Bounds on sample size for policy evaluation in 
Markov environments 



Leonid Peshkin and Sayan Mukherjee 

MIT Artificial Intelligence Laboratory 
545 Technology Square 
Cambridge, MA 02139 
{pesha , sayanjOai . mit . edu 

Abstract. Reinforcement learning means finding the optimal course of 
action in Markovian environments without knowledge of the environ- 
ment's dynamics. Stochastic optimization algorithms used in the field 
rely on estimates of the value of a policy. Typically, the value of a policy 
is estimated from results of simulating that very policy in the environ- 
ment. This approach requires a large amount of simulation as different 
points in the policy space are considered. In this paper, we develop value 
estimators that utilize data gathered when using one policy to estimate 
the value of using another policy, resulting in much more data-efffcient 
algorithms. We consider the question of accumulating a sufficient expe- 
rience and give PAC- style bounds. 

1 Introduction 

Research in reinforcement learning focuses on designing algorithms for an agent 
interacting with an environment, to adjust its behavior in such a way as to 
optimize a long-term return. This means searching for an optimal behavior in a 
class of behaviors. Success of learning algorithms therefore depends both on the 
richness of information about various behaviors and how effectively it is used. 
While the latter aspect has been given a lot of attention, the former aspect has 
not been addressed scrupulously. This work is the attempt to adapt solutions 
developed for similar problems in the field of statistical learning theory. 

The motivation for this work comes from the fact that, in reality, the pro- 
cess of interaction between the learning agent and the environment is costly in 
terms of time, money or both. Therefore, it is important to carefully allocate 
available interactions, to use all available information efficiently and to have an 
estimate of how informative the experience overall is with respect to the class of 
possible behaviors. The interaction between agent and environment is modeled 
by a Markov decision process (mdp) [7, 25]. The learning system does not know 
the correct behavior, or the true model of the environment it interacts with. 
Given the sensation of the environment state as an input, the agent chooses the 
action according to some rule, often called a policy. This action constitutes the 
output. The effectiveness of the action taken and its effect on the environment 
is communicated to the agent through a scalar value (reinforcement signal). 



The environment undergoes some transformation — changes the current state 
into a new state. A few important assumptions about the environment are made. 
In particular, the so-caUed Markov property is assumed: given the current state 
and action, the next state is independent of the rest of the history of states and 
actions. Another assumption is a non- deterministic environment, which means 
that taking the same action in the same state could lead to a different next state 
and generate a different payoff signal. It is an objective of the agent to find a 
behavior which optimizes some long-run measure of payoff, called return. 

There are many efficient algorithms for the case when the agent has perfect 
information about the environment. An optimal policy is described by mapping 
the last observation into an action and can be computed in polynomial time in 
the size of the state and action spaces and the effective time horizon [2] . How- 
ever in many cases the environment state is described by a vector of several 
variables, which makes the environment state size exponential in the number 
of variables. Also, under more realistic assumptions, when a model of environ- 
ment dynamics is unknown and the environment's state is not observable, many 
problems arise. The optimal policy could potentially depend on the whole his- 
tory of interactions and for the undiscounted finite horizon case computing it is 
PSPACE-complete [18]. In realistic settings, the class of policies is restricted and 
even among the restricted set of policies, the absolute best policy is not expected 
to be found due to the difficulty of solving a global multi-variate optimization 
problem. Rather, the only option is to explore different approaches to finding 
near-optimal solutions among local optima. 

The issue of finding a near-optimal policy from a given class of policies is 
analogous to a similar issue in supervised learning. There we are looking for a 
near-optimal hypothesis from a given class of hypotheses [28]. However, there are 
crucial differences in these two settings. In supervised learning we assume that 
there is some target function, that labels the examples, and some distribution 
that generates examples. A crucial property is that the distribution is the same 
for all the hypotheses. This implies both that the same set of samples can be 
evaluated on any hypothesis, and that the observed error is a good estimate of 
the true error. 

On the other hand, there is no fixed distribution generating experiences in 
reinforcement learning. Each policy induces a different distribution over expe- 
riences. The choice of a policy defines both a "hypothesis" and a distribution. 
This raises the question of how one re-uses the experience obtained while follow- 
ing one policy to learn about another. The other policy might generate a very 
different set of samples (experiences), and in the extreme case the support of the 
two distributions might be disjoint. 

In the pioneering work by Kearns et al. [9] , the issue of generating enough 
information to determine the near-best policy is considered. Using a random 
policy (selecting actions uniformly at random), they generate a set of history 
trees. This information is used to define estimates that uniformly converge to 
the true values. However, this work relies on having a generative model of the 
environment, which allows simulation of a reset of the environment to any state 



and execute any action to sample an immediate reward. Also, the reuse of in- 
formation is partial — an estimate of a policy value is built only on a subset of 
experiences, "consistent" with the estimated policy. 

Mansour [11] has addressed the issue of computational complexity in the 
setting of Kearns et al. [9] by establishing a connection between mistake bounded 
algorithms (adversarial on-line model [10]) and computing a near-best policy 
from a given class with respect to a finite-horizon return. Access to an algorithm 
that learns the policy class with some maximal permissible number of mistakes 
is assumed. This algorithm is used to generate "informative" histories in the 
POMDP, following various policies in the class, and determine a near-optimal 
policy. In this setting a few improvements in bounds are made. 

In this work we present a way of reusing all of the accumulated experience 
without having access to a generative model of the environment. We make use of 
the technique known as "importance sampling" [26] or "likelihood ratio estima- 
tion" [5] to different communities. We discuss properties of different estimators 
and provide bounds for the uniform convergence of estimates on the policy class. 
We suggest a way of using these bounds to select among candidate classes of poli- 
cies with various complexity, similar to structural risk minimization [28]. 

The rest of this paper is organized as follows. Section 2 discusses reinforce- 
ment learning as a stochastic optimization problem. In Section 3 we define our 
notation. Section 4 presents the necessary background in sampling theory and 
presents our way of estimating the value of policies. The algorithm and PAC-style 
bounds are given in Section 5. 

2 Reinforcement Learning as Stochastic Optimization 

There are various approaches to solving rl problems. Value search algorithms 
find the optimal policy by using dynamic-programming methods to compute 
the value function — utility of taking a particular action in a particular world 
state — then deducing the optimal policy from the value function. Policy search 
algorithms (e.g. REINFORCE [30]) work directly in the policy class, trying to max- 
imize the expected reward without the help of Bellman's optimality principle. 

Policy search methods rely on estimating the value of the policy (or the 
gradient of the value) at various points in a policy class and attempt to solve the 
optimization issue. In this paper we ignore the optimization issue and concentrate 
on the estimation issue — how miich and what kind of experience one needs to 
generate in order to be able to construct uniformly good value estimators over 
the whole policy class. In particular we would like to know what the relation is 
between the number of sample experiences and the confidence of value estimates 
across the policy class. 

Two different approaches to optimization can be taken. One involves using an 
algorithm, driven by newly generated policy value (or gradient thereof) estimates 
at each iteration to update the hypothesis about the optimal policy after each 
interaction (or few interactions) with the environment. We will call this on-line 
optimization. Another is to postpone optimization until all possible interaction 



with the environment is exhausted, and combine all information available in 
order to estimate {off-line) the whole "value surface". 

In this paper we are not concerned with the question of optimization. We 
concentrate on the second case with the goal of building a module that contains 
a non-parametric model of optimization surface. Given an arbitrary policy such 
a module outputs an estimate of its value, as if the policy was tried out in 
the environment. Once such module is built and guarantees on good estimates 
of policy value are obtained across the policy class, we may use our favorite 
optimization algorithm. Gradient descent methods, in particular REINFORCE [30, 
31] have been used recently in conjunction with policy classes constrained in 
various ways, e.g., with external memory [21], finite state controllers [14] and in 
multi-agent settings [20]. Furthermore, the idea of using importance sampling 
in the reinforcement learning has been explored [13, 24]. However only on-line 
optimization was considered. 

One realistic off-line scenario in reinforcement learning is when the data 
processing and optimization (learning) module is separated (physically) from 
the data acquisition module (agent). Say we have an ultra- light micro-sensor 
connected to a central computer . The agent then has to be instructed initially 
how to behave when given a chance to interact with the environment for a 
limited number of times, then bring/ transmit the collected data back. Naturally 
during such limited interaction only a few possible behaviors can be tried out. 
It is extremely important to be able to generalize from this experience in order 
to make a judgment about the quality of behaviors which were not tried out. 
This is possible when some kind of similarity measure in the policy class can 
be established. If the difference between the values of two policies could be 
estimated, we could estimate the value of one policy based on experience with 
the other. 

3 Background and Notation 

MDP The class of problems described above can be modeled as Markov decision 
processes (mdps). An mdp is a 4-tuple (5, A, T, R), where: S is the set of states; 
A is the set of actions; T : S x A ^ T^{S) is a mapping from states of the 
environment and actions of the agent to probability distributions^ over states 
of the environment; and R:SxA^ R is the payoff function {reinforcement), 
mapping states of the environment and actions of the agent to immediate reward. 

POMDP The more complex case is when the agent is no longer able to reliably 
determine which state of the mdp it is currently in. The process of generating an 
observation is modeled by an observation function B{s{t)). The resulting model 
is a partially observable Markov decision process (pomdp) . Formally, a pomdp is 
defined as a tuple {S, O, A, B, T, R) where: S is the set of states; O is the set of 
observations; A is the set of actions; B is the observation function S: 5 — > V{0); 

^ Let P{fi) denote the set of probability distributions defined on some space f2. 



T:SxA^V{S) is a mapping from states of the environment and actions of the 
agent to probabihty distributions over states of the environment; R: S x A ^ TZ 
is the payoff function, mapping states of the environment and actions of the 
agent to immediate reward. In a POMDP, at each time step: an agent observes 
o{t) corresponding to B{s{t)) and performs an action a{t) according to its pohcy, 
inducing a state transition of the environment; then receives the reward r{t). We 
assume that the rewards r{s, a) are bounded by rmax for any s and a. 

History We denote by Ht the set of all possible experience sequences of length 
t: Ht = {(o(l), a(l), r(l), . . . , o{t),a{t), r{t), o{t + 1))}, where o{t) G O is the ob- 
servation of agent at time t; a{t) G ^ is the action the agent has chosen to take 
at time t; and r{t) € TZ is the reward received by agent at time t. In order to 
specify that some clement is a part of the history h at time r, we write, for 
example, r(T, h) and a(T, h) for the r*'' reward and action in the history h. 

Policy Generally speaking, in a pomdp, a policy tt is a rule specifying the ac- 
tion to perform at each time step as a function of the whole previous history: 
TT : H ^ V{A). Policy class H is any set of policies. We assume that the prob- 
ability of the elementary event is bounded away from zero: < c < Pv{a\h,n), 
for any a G A, h G H and w G H. 

Return A history h includes several immediate rewards . . . r{i) . . .), that 
can be combined to form a return R{h). In this paper we focus on returns which 
may be computed (or approximated) using the first T steps, and are bounded 
in absolute value by Rmax- This includes two well-studied return functions — the 
undiscounted finite horizon return and the discounted infinite-horizon return. 
The first is R{h) = ft), where T is the finite- horizon length. In this 

case Rmax = Tvmax- The second is the discounted infinite horizon return [25] 
R{h) = X^t^o 'wit^i ^ geometric discounting by the factor 76 (0; 1). 
In this case we can approximate R using the first = log^ -p-^ — immediate 
rewards. Using steps we can approximate R within e since Rmax = and 

^*'r{t) — X^^Q "Y*r{t) < e. It is important to approximate the return in T 
steps, since the length of the horizon is a parameter in our bounds. 

Value Any policy w G H defines a conditional distribution Pr(/i|7r) on the class 
of all histories H. The value of policy tt is the expected return according to the 
probability induced by this policy on histories space: 

V{n) = [R{h)] = J2 Pr(/i|7r)] , 

heH 

where for brevity we introduced notation E^r for Epr(/i|^). It is an objective of the 
agent to find a policy tt* with optimal value: tt* = argmax^y(7r). We assume 
that policy value is bounded by Vmax- That means of course that returns are 
also bounded by Vmax since value is a weighted sum of returns. 



4 Sampling 



For the sake of clarity we are introducing concepts from sampling theory us- 
ing functions and notation for relevant reinforcement learning concepts. Rubin- 
stein [26] provides a good overview of this material. 

"Crude" sampling If we need to estimate the value V{tt) of policy tt, from 
independent, identically distributed (i.i.d.) samples induced by this policy, after 
taking A'' samples hi,i G {1..N) we have: 

1 



F(7r) = 



N 



R{hi) 



The expected value of this estimator is V{-k) and it has variance Var V{-k) 



l^i?(/.rPr(%)-l 



.heH 



N 



Indirect sampling Imagine now that for some reason we are unable to sample 
from the policy tt directly, but instead we can sample from another policy tt'. 
The intuition is that if we knew how "similar" those two policies were to one 
another, we could use samples drawn according to the distribution tt' and make 
an adjustment according to the similarity of the policies. Formally we have: 

Pl(h\TT) 



where an agent might not be (and most often is not) able to calculate Pr(/i|7r). 



Lemma 1. It is possible to calculate p^j^l^j^)^ for any tt, tt' G 77 and h G H. 
Proof. The Markov assumption in POMDPs warrants that 

T 

Fv{h\n) = Pr(s(0)) J] Pr{o{t)\s{t)) Pv{a{t)\o{t), tt) Pr(s(t + l)\s{t), a{t)) 



t=i 

T 



Pr(s(0)) n P<o{t)m) Pr(s(i + ^Mt), a{t)) 



l[PT{a{t)\o{t),n) 



t=i 



= Pr(/ie)Pr(/ia|7r) . 

Pr(/ie) is the probability of the part of the history, dependent on the environment, 
that is unknown to the agent and can be only sampled. Pr{ha\TT) is the probability 
of the part of the history, dependent on the agent, that is known to the agent 
and can be computed (and differentiated). Therefore we can compute 

Pr(/i|7r) Pr(/ie)Pr(/i„|7r) Pr(/i„|7r) 



Pr(/i|7r') Pr(/ie)Pr(/i„|7r') Pr(/i„|7r') 



□ 



We can now construct an indirect estimator V^/ (tt) from i.i.d. samples hi,i G 
{'i-.-N) according to the distribution Pr(/i|7r'): 



Var 



i 

where for convenience, wc denote the fraction p^f^^j^)'^ by w.,r{h,Tr'). This is an 
unbiased estimator of F(7r) with variance 

Vn'in)] =j^h:^^„{Rih)wAh,n')fPvih\7r')-V{'Kf} 

__ 1 /v- {R{h)Vv(h\T,)f V(^\A (2) 

- N \2^h(iH Pr(/(|7r') l^j J ^ ' 

= iE, [R{hfw^{hy)\ - iy(7r)' . 

This estimator Vtji(tt) is usually called in statistics [26] an importance sam- 
pling (is) estimator because the probability Pr(ft,|7r') is chosen to emphasize parts 
of the sampled space that are important in estimating V. The technique of IS was 
originally designed to increase the accuracy of Monte Carlo estimates by reduc- 
ing their variance [26]. Variance reduction is always a result of exploiting some 
knowledge about the estimated quantity. 

Optimal sampling policy It can be shown [8], for example by optimizing the ex- 
pression 2 with Lagrange multipliers, that the optimal sampling distribution is 
Pr(/i|7r') = ^^^^^^T^^y^, which gives an estimator with zero variance. Not surpris- 
ingly this distribution can not be used, since it depends on prior knowledge of a 
model of the environment (transition probabilities, reward function), which con- 
tradicts our assumptions, and on the value of the policy which is what we need 
to calculate. However all is not lost. There are techniques which approximate 
the optimal distribution, by changing the sampling distribution during the trial, 
while keeping the resulting estimates unbiased via reweighting of samples, called 
"adaptive importance sampling" and "effective importance sampling" (see, for 
example, [15, 32, 17]). In the absence of any information about R{h) or estimated 
policy, the optimal sampling policy is the one which selects actions uniformly at 
random: Pr(a|/i) = ^. For the horizon T, this gives us the upper bound which 
we denote 77: 

W^{h,7T')<2'^{l-cf = 7] . (3) 

Remark. One interesting observation is that it is possible to get a better esti- 
mate of V{Tr) while following another policy tt'. Here is an illustrative example: 
imagine that reward function R{h) is such that it is zero for all histories in some 
sub-space Hq of history space H. At the same time policy tt, which we are try- 
ing to estimate spends almost all the time there, in Hq. If we follow tt in our 
exploration, we are wasting samples/time! In this case, we can really call what 
happens "importance sampling", unlike usually when it is just "reweighting", 
not connected to " importance" per sc. That is why we advocate using the name 
"likelihood ratio" rather than "importance sampling". 



Remark. So far, we talked about using a single policy to collect all samples 
for estimation. We also made an assumption that all considered distributions 
have equal support. In other words, we assumed that any history has a non- 
zero probability to be induced by any policy. Obviously it could be beneficial to 
execute a few different sampling policies, which might have disjoint or overlap- 
ping support. There is literature on this so-called stratification sampling tech- 
nique [26]. Here we just mention that it is possible to extend our analysis by 
introducing a prior probability on choosing a policy out of a set of sampling 
policies, then executing this sampling policy. Our sampling probability will be- 
come: Pr(/i) = Pr(7r')Pr(/i|7r'). 

5 Algorithm and Bounds 

Table 1 presents the computational procedure for estimating the value of any 
policy from the policy class off-line. The sampling stage consists of accumulating 
histories hi, i G [l.-A^] induced by a sampling policy tt' and calculating returns on 
these histories R{hi). After the first stage is done, the procedure can simulate the 
interaction with the environment for any policy search algorithm, by returning 
an estimate for arbitrary policy. 

[ht] 

Table 1. Policy evaluation 



Sampling stage: 

Chose a sampling policy tt'; 

Accumulate the set of histories hi, i € [l-.Af] induced by 7r'; 
Calculate the set of returns R{hi), i £ [l-.A?^]; 
Estimation stage: 
Input: policy n £ 11 
Calculate w,r(/ii,7r') for i € [l-.N]; 

Output: estimate V(7r) according to equation 1: -i- R{hi)wTr{hi,'K') 



5.1 Sample Complexity 

We first compute deviation bounds for the is estimator for a single policy from 
its expectation using Bernstein's inequality: 

Theorem Bernstein [3] Let ^i,^2,--- be independent random variables with 
identical meanE^, bounded by some constant < a, a> 0. Also /et Var(Mjv) = 



E^i + • • • + E^AT < L . Then the partial sums Mjv = + • • • +Cjv obey the following 
inequality for all e > 0; 



Pr 



N 



N 



> e < 2 exp 



1 e'^N 
2L + ae 



Lemma 2. With probability {1 — S) the following holds true. The estimated value 
y(7r) based on N samples is close to the true value V{Tr) for some policy n : 



V{tt) - V{Tr) 



< 



N 



(log(l/(5)r? + V21og(l/5)(r,-l)+log(l/(5)V) • 



Proof. In our setup, = R{hi)wT^(hi,T:'), and = E^r' [i?(/ii)w7r(/ii, tt')] = 
Ejr [_R(/ii)] = V{'k)] and a = VmaxV by equations. According to equation 2 

L = Var(Mjv) = YaxV^^-n) < ^(t? - 1). So we can use Bernstein's inequality 
and we get the following deviation bound for a policy tt: 



Pr y (tt) - y (tt) > e) < 2 exp 



1 

N 



+ Vmaxie 

After solving for e, we get the statement of Lemma 2. 



(4) 



□ 



Note that this result is for a single policy. We need a convergence result 
simultaneously for all policies in the class iT. We proceed using classical uniform 
convergence results for covering numbers as a measure of complexity. 

Remark. We use covering numbers (instead of VC dimension as Kearns et al. [9]) 
both as a measure of the metric complexity of a policy class in a union bound 
and as a parameter for bounding the likelihood ratio. Another advantage is that 
metric entropy is a more refined measure of capacity than VC dimension since 
the VC dimension is an upper bound on the growth function which is an upper 
bound on the metric entropy [28]. 

Definitions. Let U be class of policies that form a metric space with metric 
-Doo(7r, tt') and e > 0. The covering number N{n, D, e) is defined as the minimal 
integer t such that there exist i disks in 11 with radius e covering iT. If no such 
partition exists for some e > then the covering number is infinite. The metric 
entropy is defined as /C(iT, -D, e) = \ogJ\f{n, D, e). 

Theorem 4. With probability 1 — S the difference \V{Tr) — V{n)\ is less than e 
simultaneously for all n € 11 for the sample size: 



N = 



^2^(l-cf (log(l/5) + /C) 



Proof. Given a class of policies 77 with finite covering number J\f{n, D,e), the 
upper bound ry = 2^(1 — c)^ on the likelihood ratio, and e > 0, 



Pr ( sup y (tt) - V{-it) >e] <8Af f 77, D, ^) exp 
Knen ) V 8/ 



1 



128 vlnAn-^) , Kn.xr)6 

AT ^ 8 



Note the relationship to equation 4. The only essential difference is in the cover- 
ing number, which takes into account the extension from a single policy tt to the 
class 77. This requires the sample size N to increase accordingly to achieve the 
given confidence level. The derivation is similar to uniform convergence result of 
Pollard [22] (see pages 24-27), using Bernstein's inequality instead of Hoeffding's. 
Solving for A'' gives us the statement of the theorem. □ 

Let us compare our result with a similar result for algorithm by Kearns et 
al. [9]: 

TV = O [jy^ 2'^VCin)\ogiT) {T + \og{Vma./e) +log{l/S))^ (5) 

both dependences are exponential in the horizon, however in our case the de- 
pendence on ( ) is linear rather than quadratic. The metric entropy \og{Af) 
takes the place of the VC dimension VC (77) in terms of class complexity. This 
reduction in a sample size could be explained by the fact that the former algo- 
rithm uses all trajectories for evaluation of any policy, while the latter uses just 
a subset of trajectories. 

Remark. Let us note that a weaker bound which is remarkably similar to the 
equation 5 could be obtained [19] using Mc-Diarmid [12] theorem, applicable for 
a more general case: 

N = l^(^) ' 22^(1 - cf^ (/C + log(l/(5)) j . 

The proof is based on the fact that replacing one history hi in the set of samples 
hi, i e (l-.iV) for the estimator K-'(7r) of equation 1, can not change the value of 
the estimator by more than Yin^!Lll_ 



5.2 Bounding the Likelihood Ratio 

We would like to find a way to estimate a policy which minimizes sample com- 
plexity. Remember that we are free to choose a sampling policy. We have dis- 
cussed what it means for one sampling policy tt to be optimal with respect to 
another. Here we would like to consider what it means for a sampling policy tt 
to be optimal with respect to a policy class 77. Choosing the optimal sampling 
policy allows us to improve bounds with regard to exponential dependence on 
the horizon T. The idea is that if we are working with a policy class of a fi- 
nite metric dimension, the likelihood ratio can be upper bounded through the 



covering number due to the limit in combinatorial choices. The trick is to con- 
sider sample complexity for the case of the sampling policy being optimal in the 
information -theoretic sense. 

This derivation is very similar to the one of an upper bound on the mini- 
max regret for predicting probabilities under logarithmic loss [4, 16]. The upper 
bounds on logarithmic loss we use were first obtained by Opper and Haussler [16] 
and then generalized by Cesa-Bianchi and Lugosi [4] . The result of Cesa-Bianchi 
and Lugosi is more directly related to the reinforcement learning problem since it 
applies to the case of arbitrary rather than static experts, which corresponds to 
learning a policy. First, we describe the sequence prediction problem and result 
of Cesa-Bianchi and Lugosi, then show how to use this result in our setup. 

In a sequential prediction game T symbols = (a(l), . . . , a(T)) are ob- 
served sequentially. After each observation a{t — 1), a learner is asked how likely 
it is for each value a e A to be the next observation. The learner goal is to 
assign a probability distribution Pr(a(t)|/i*~^; tt') based on the previous val- 
ues. When at the next time step t, the actual now observation a{t) is revealed, 
the learner suffers a loss — log(Pr(a(t)|/i^^^; tt'). At the end of the game, the 
learner has suffered a total loss — J2t=i logPr(a(t)|/i*~^; tt'). Using the join dis- 
tribution Pr(/i^|7r') = Ylt^iPj^{a{t)\h*g^^;w') we are going to write the loss as 
— logPr(/i^|7r'). When it is known that the sequences ft,^ arc generated by some 
probability distribution tt from the class 11, we might ask what is the worst 
regret: the difference in the loss between the learner and the best expert in the 
target class 11 on the worst sequence: 



Using the explicit solution to the minimax problem due to Shtarkov [27] 
Cesa-Bianchi and Lugosi prove the following theorem: 

Theorem Cesa-Bianchi and Lugosi [4] theorem 3 For any policy class 11: 



where covering number and metric entropy for the class U, are defined using the 

distance measure Doo{tt,tt') = sup^j^^^ |logPr(a|7r) — logPr(a|7r')| . 

It is now easy to relate the problem of bounding the likelihood ratio to the 
worst case regret. Intuitively, we are asking what is the worst case likelihood ratio 
if we have the optimal sampling policy. Optimality means that our sampling 
policy will induce action sequences with probabilities close to the estimated 
policies. Remember that likelihood ratio depends only on actions sequence ha in 
the; history h according to the Lemma 1. We need to upper bound the maximum 

value of the ratio p^^^^j^, which corresponds to inf^/ sup^^ ( Pr(ha \ n') ) " 





Lemma 5. By the definition of the maximum likelihood policy sup^g^j Pr(/i(j|7r) 

VJC hdVG ' 

™/suP p I A < inf sup <^ p I > . 

^' ha VPr(/ia|7r')/ ^' ha I Pr(/iaF') J 

Henceforth we can directly apply the results of Cesa-Bianchi and Lugosi and 
get a bound of e^'^ . Note the logarithmic dependence of the bound on Rt with 
respect to the covering number J\f. Moreover, since actions a belong to the finite 
set of actions A, many of the remarks of Cesa-Bianchi and Lugosi regarding 
finite alphabets apply [4]. In particular, for most "parametric" classes which 
can be parametrized by a bounded subset of TZ" in some "smooth" way [4] — the 
metric entropy scales as follows: for some positive constants ki and k2, 

logAf{n, D,e)<ki log 
For such policies the minimax regret can be bounded by 

i?T< ylogT + o(logT), 

ki 

which makes the likelihood ratio bound of rj = 0{{T)~). In this case exponen- 
tial dependence on the horizon is eliminated and the sample complexity bound 
becomes 

N = (Y^T^ {jc + log 1/5)] . 



6 Discussion and Future Work 



In this paper, we developed value estimators that utilize data gathered when 
using one policy, to estimate the value of using another policy, resulting in data- 
efficient algorithms. We considered the question of accumulating a sufficient 
experience and gave PAC-style bounds. Note that for these bounds to hold the 
covering number of the class of policies 11 should be finite. 

Armed with the theorem 4 we are ready to answer a very important ques- 
tion of how to choose among several candidate policy classes. Our reasoning 
here is similar to that of structural risk minimization principal by Vapnik [28]. 
The intuition is that given a very limited data, one might prefer to work with a 
primitive class of hypotheses with good confidence, rather than getting lost in a 
sophisticated class of hypotheses due to low confidence. Formally, we would have 
the following method: given a set of policy classes 77i, 7T2, . . . with corresponding 
covering numbers A/i, A/'2, . . ., a confidence 6 and a number of available samples 
A'', compare error bounds ei, £2, . . . according to the theorem 4. Another way to 
utilize the result of theorem 4 is to find what is the minimal experience neces- 
sary to be able to provide the estimate for any policy in the class with a given 
confidence. This work also provides insight for a new optimization technique. 
Given the value estimate, the number of samples used, and the covering number 



of the policy class, one can search for optimal policies in a class using a new cost 
function ^(Tr) + ^{M, 5, N) < V{tt). This is similar in spirit to using structural 
risk minimization instead of empirical risk minimization. 

The capacity of the class of policies is measured by bounds on covering num- 
bers in our work or by VC-dimension in the work of Kearns et al. [9]. The worst 
case assumptions of these bounds often make them far too loose for practical use. 
An alternative would be to use more empirical or data dependent measures of 
capacity, e.g. the empirical VC dimension [29] or maximal discrepancy penalties 
on splits of data [1], which tend to give more accurate results. 

We are currently working on extending our results for the weighted impor- 
tance sampling (wis) estimator [23, 26] which is a biased but consistent estimator 
and has a better variance for the case of small number of samples. This can be 
done using martingale inequalities by Mc;-Diarnii(l [12] to parallel Bernstein's 
result. There is room for employing various alternative sampling techniques, in 
order to approximate the optimal sampling policy, for example one might want 
to interrupt uninformative histories, which do not bring any return for a while. 
Another place for algorithm sophistication is sample pruning for the case when 
the set of histories gets large. A few most representative samples can reduce the 
computational cost of estimation. 
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